Take Home Exercise 3

Uncover illegal, unreported, and unregulated (IUU) fishing activities through visual analytics

Author

Fangxian

Published

June 4, 2023

Modified

June 18, 2023

Task

Using the data provide by the VAST challenge, we are looking into the Mini-Challenge 3 (MC3) to identify compaines possibly engaged in illegal, unreported, and unregulated (IUU) fishing.

Data Wraggling

Load Packages

Show the code
pacman::p_load(jsonlite, tidygraph, ggraph, visNetwork, graphlayouts, ggforce,skimr,tidytext, tidyverse,igraph, topicmodels,tm)

Data Import

In the code chunk below, fromJSON() of jsonlite package is used to import MC3.json into R environment.

Show the code
mc3_data <- fromJSON("data/MC3.json")

Examine the data, this is not a directed graph, not looking into in- and out-degree of the nodes.

Extracting edges

Below code chunk changes the links field into character field.

Show the code
mc3_edges <- as_tibble(mc3_data$links)%>%
  distinct() %>%
  mutate(source = as.character(source),
         target = as.character(target),
         type = as.character(type)) %>%
  group_by(source, target, type) %>%
    summarise(weights = n()) %>%
  filter(source!=target)%>%
  ungroup

Extracting nodes

Show the code
mc3_nodes <- as_tibble(mc3_data$nodes) %>%
#  distinct()%>%
  mutate(country = as.character(country),
         id = as.character(id),
         product_services = as.character(product_services),
         revenue_omu = as.numeric(as.character(revenue_omu)),
         type = as.character(type)) %>%
    select(id, country, type, revenue_omu, product_services)

Initial Data Exploration

Exploring the edges dataframe

In the code chunk below, skim() of skimr package is used to display the summary statistics of mc3_edges tibble data frame.

Show the code
skim(mc3_edges)
Data summary
Name mc3_edges
Number of rows 24036
Number of columns 4
_______________________
Column type frequency:
character 3
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
source 0 1 6 700 0 12856 0
target 0 1 6 28 0 21265 0
type 0 1 16 16 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
weights 0 1 1 0 1 1 1 1 1 ▁▁▇▁▁

The report above reveals that there is not missing values in all fields.

In the code chunk below, datatable() of DT package is used to display mc3_edges tibble data frame as an interactive table on the html document.

Show the code
DT::datatable(mc3_edges)

Below code chunks, counting number of companies a person owns and the number of owners a company has.

Show the code
ggplot(data = mc3_edges,
       aes(x=type)) +
  geom_bar()

Show the code
unique_ids <- unique(mc3_edges$target)
num_unique_ids <- length(unique_ids)
num_unique_ids
[1] 21265
Show the code
Noofcompanies <- mc3_edges %>%
  group_by(target, source, type) %>%
  filter(type == "Beneficial Owner") %>%
  summarise(count=n()) %>%
  group_by(target)%>%
  summarise(count=sum(count))

psych::describe(Noofcompanies)
        vars     n   mean      sd median trimmed     mad min   max range skew
target*    1 15305 7653.0 4418.32   7653    7653 5672.43   1 15305 15304 0.00
count      2 15305    1.1    0.40      1       1    0.00   1     9     8 6.28
        kurtosis    se
target*    -1.20 35.71
count      61.69  0.00
Show the code
Noofowners <- mc3_edges %>%
  group_by(source, target, type) %>%
  summarise(count=n()) %>%
  group_by(source)%>%
  summarise(count=sum(count))

psych::describe(Noofowners)
        vars     n    mean      sd median trimmed     mad min   max range  skew
source*    1 12856 6428.50 3711.35 6428.5 6428.50 4765.08   1 12856 12855  0.00
count      2 12856    1.87    3.47    1.0    1.22    0.00   1   120   119 11.36
        kurtosis    se
source*    -1.20 32.73
count     215.82  0.03

Below code chunk we are interested to see top 50 owners owning multiple companies, with John Smith and Michael Johnson have the highest of 9 companies to their name. This could be suspicious as why they need so many companies.

Show the code
list_top_50 <- Noofcompanies %>%
  arrange(desc(count)) %>%
  top_n(50, wt = count) 

ggplot(data = list_top_50, 
       aes(x = reorder(target, -count), y = count)) +
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) 

Exploring the nodes dataframe

Show the code
skim(mc3_nodes)
Data summary
Name mc3_nodes
Number of rows 27622
Number of columns 5
_______________________
Column type frequency:
character 4
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
id 0 1 6 64 0 22929 0
country 0 1 2 15 0 100 0
type 0 1 7 16 0 3 0
product_services 0 1 4 1737 0 3244 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
revenue_omu 21515 0.22 1822155 18184433 3652.23 7676.36 16210.68 48327.66 310612303 ▇▁▁▁▁

The report above reveals that there is no missing values in all fields.

In the code chunk below, datatable() of DT package is used to display mc3_nodes tibble data frame as an interactive table on the html document.

Show the code
DT::datatable(mc3_nodes)

Below code chunk to find out how is the distribution among the types of ownerhships.

Show the code
ggplot(data = mc3_nodes,
       aes(x = type)) +
  geom_bar()

Below code chunk we check on the revenue distribution among the types of ownerships.

Show the code
ggplot(data = mc3_nodes,
       aes(x= type,
         y = revenue_omu)) +
  geom_boxplot()

We combined the nodes and edges data so we can find out more on the owner-company relationships.

Show the code
combined <- left_join(mc3_nodes,mc3_edges,
                  by=c("id"="source"))

Below code chunk to find out more on which owners have high number of companies also generating a lot of revenue.

Show the code
combined <- combined %>%
  group_by(target, type.y, id, country, type.x, product_services)%>%
  summarize(revenue_omu) %>%
  filter(type.y == "Beneficial Owner")

filtered_combined <- combined %>%
  filter(target %in% list_top_50$target)%>%
  arrange(desc(revenue_omu))

ggplot(data = filtered_combined, 
       aes(x = target, y = revenue_omu)) +
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) 

Michael Johnson, Mark Miller and James Rodriguez stand out from the above chart, below code we want to see what business they did that generate more revenue.

Show the code
Top_3_Revenue<- combined %>%
  filter (target %in% c("Michael Johnson", "Mark Miller","James Rodriguez")) %>%
  arrange(desc(revenue_omu))

DT::datatable(Top_3_Revenue)

Insights

From the above data table, we see that Michael Johnson is involved in the fishing business and having many companies in different countries. The FishEye could probably look more into his business landscape across different companies and his business activities to understand more.

Text Sensing with tidytext

Simple word count

The code chunk below calculates number of times the word fish appeared in the field product_services.

Show the code
mc3_nodes %>% 
    mutate(n_fish = str_count(product_services, "fish")) 
# A tibble: 27,622 × 6
   id                          country type  revenue_omu product_services n_fish
   <chr>                       <chr>   <chr>       <dbl> <chr>             <int>
 1 Jones LLC                   ZH      Comp…  310612303. Automobiles           0
 2 Coleman, Hall and Lopez     ZH      Comp…  162734684. Passenger cars,…      0
 3 Aqua Advancements Sashimi … Oceanus Comp…  115004667. Holding firm wh…      0
 4 Makumba Ltd. Liability Co   Utopor… Comp…   90986413. Car service, ca…      0
 5 Taylor, Taylor and Farrell  ZH      Comp…   81466667. Fully electric …      0
 6 Harmon, Edwards and Bates   ZH      Comp…   75070435. Discount superm…      0
 7 Punjab s Marine conservati… Riodel… Comp…   72167572. Beef, pork, chi…      0
 8 Assam   Limited Liability … Utopor… Comp…   72162317. Power and Gas s…      0
 9 Ianira Starfish Sagl Import Rio Is… Comp…   68832979. Light commercia…      0
10 Moran, Lewis and Jimenez    ZH      Comp…   65592906. Automobiles, tr…      0
# ℹ 27,612 more rows

Tokenisation

The word tokenisation have different meaning in different scientific domains. In text sensing, tokenisation is the process of breaking up a given text into units called tokens. Tokens can be individual words, phrases or even whole sentences. In the process of tokenisation, some characters like punctuation marks may be discarded. The tokens usually become the input for the processes like parsing and text mining.

In the code chunk below, unnest_token() of tidytext is used to split text in product_services field into words.

Show the code
token_nodes <- mc3_nodes %>%
  unnest_tokens(word, 
                product_services)

The two basic arguments to unnest_tokens() used here are column names. First we have the output column name that will be created as the text is unnested into it (word, in this case), and then the input column that the text comes from (product_services, in this case).

Show the code
token_nodes %>%
  count(word, sort = TRUE) %>%
  top_n(15) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
      labs(x = "Count",
      y = "Unique words",
      title = "Count of unique words found in product_services field")

The bar chart reveals that the unique words contains some words that may not be useful to use. For instance “a” and “to”. In the word of text mining we call those words stop words. You want to remove these words from your analysis as they are fillers used to compose a sentence.

Using filter we also discover many “character(0)” which has no meaning in itself, we will also proceed to replace them with “NA”.

Removing stopwords

Show the code
token_nodes$word[token_nodes$word == "character"] <- "NA"
token_nodes$word[token_nodes$word == "0"] <- "NA"
Show the code
stopwords_removed <- token_nodes %>% 
  anti_join(stop_words)
Show the code
stopwords_removed %>%
  count(word, sort = TRUE) %>%
  top_n(15) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
      labs(x = "Count",
      y = "Unique words",
      title = "Count of unique words found in product_services field")

Show the code
stopwords_removed %>%
  count(word, sort = TRUE) %>%
  top_n(20) %>%
  mutate(word = reorder(word, n)) %>%
  filter(!word %in% head(word, 3)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
  labs(x = "Count",
       y = "Unique words",
       title = "Count of unique words found in product_services field")

Initial Network Visualization and Analysis

Building network model with tidygraph

From the above text insights, we are interested to see the network of companies of Beneficial Owners with fish as their product services.

Show the code
mc3_nodes_fish <- stopwords_removed %>%
  filter(stopwords_removed$word == "fish")
Show the code
mc3_edges_fish <- mc3_edges[mc3_edges$source %in% mc3_nodes_fish$id,] %>%
  filter(type == "Beneficial Owner")

id1 <- mc3_edges_fish %>%
  select(source) %>%
  rename(id = source) 
id2 <- mc3_edges_fish %>% 
  select(target) %>% 
  rename(id = target) 
mc3_nodes_fish <- rbind(id1, id2) %>%
  distinct() %>% 
  left_join(mc3_nodes_fish,
            unmatched = "drop") 
Show the code
mc3_graph <- tbl_graph(nodes = mc3_nodes_fish,                        
                      edges = mc3_edges_fish,                        
                        directed = FALSE)  

mc3_graph<-mc3_graph%>%
  mutate(betweenness=centrality_betweenness())

mc3_graph
# A tbl_graph: 1190 nodes and 876 edges
#
# An unrooted forest with 314 trees
#
# A tibble: 1,190 × 6
  id                                 country type  revenue_omu word  betweenness
  <chr>                              <chr>   <chr>       <dbl> <chr>       <dbl>
1 Adams Group                        ZH      Comp…       9056. fish            0
2 Albertine Rift  NV Family          Marebak Comp…       9761. fish            3
3 Allen PLC                          ZH      Comp…      61582. fish            0
4 Ancla del Mar Pic Worldwide        Thessa… Comp…       4667. fish            3
5 Andhra Pradesh  OJSC Marine conse… Yggdra… Comp…      25758. fish            0
6 Andhra Pradesh  OJSC Marine conse… Yggdra… Comp…      25758. fish            0
# ℹ 1,184 more rows
#
# A tibble: 876 × 4
   from    to type             weights
  <int> <int> <chr>              <int>
1     1   321 Beneficial Owner       1
2     2   322 Beneficial Owner       1
3     2   323 Beneficial Owner       1
# ℹ 873 more rows

Using the distribution function to understand the centrality_betweenness().

Show the code
ggplot(as.data.frame(mc3_graph),aes(x=betweenness))+
  geom_histogram(bins=10,fill="lightblue",colour="black")+
  ggtitle("Distribution of centrality betweenness")+
  theme(plot.title = element_text(hjust=0.5))

Looking at this, we can filter our records where the centrality between is greater than 50 to understand the interactions.

Show the code
set.seed (1234)
degrees <- degree(mc3_graph)
V(mc3_graph)$degree <- degrees

mc3_graph %>%
  filter(betweenness >= 50) %>%
ggraph(layout = "fr") +
  geom_edge_link(aes(alpha=0.5)) +
  geom_node_point(aes(
    size = betweenness,
    colors = "lightblue",
    alpha = 0.5)) +
  scale_size_continuous(range=c(1,10))+
  geom_node_text(aes(label = id, filter= betweenness >=50 & degree >0), repel = TRUE)+
  theme_graph()

Show the code
list_top_30 <- Noofowners %>%
  arrange(desc(count)) %>%
  top_n(30, wt = count) 

ggplot(data = list_top_30, 
       aes(x = reorder(source, -count), y = count)) +
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) 

Topic Modelling

Show the code
corpus <- Corpus(VectorSource(stopwords_removed$word))

# Convert the corpus to a document-term matrix
dtm <- DocumentTermMatrix(corpus)

# Convert the document-term matrix to a tidy format
tidy_dtm <- tidy(dtm)
tidy_dtm
# A tibble: 50,357 × 3
   document term         count
   <chr>    <chr>        <dbl>
 1 1        automobiles      1
 2 2        passenger        1
 3 3        cars             1
 4 4        trucks           1
 5 5        vans             1
 6 6        buses            1
 7 7        holding          1
 8 8        firm             1
 9 9        subsidiaries     1
10 10       engaged          1
# ℹ 50,347 more rows
Show the code
# vocabulary <- tidy_dtm %>%
#   filter(count > 1)
# 
# str(tidy_dtm)
# head(tidy_dtm)
# 
# # Create the LDA model
# lda_model <- LDA(tidy_dtm, k = 5, control = list(seed = 1234)) 
Show the code
# topics <- tidy(lda_model, matrix = "beta")
# 
# topics